Abstract

Labeling speech signals is a critical activity that cannot be overlooked in any of the early phases of designing a system based on speech technology. For this, an efficient particle swarm optimization (PSO)-based clustering algorithm is proposed to classify the speech classes, i.e., voiced, unvoiced, and silence. A sample of 10 signal waves is selected, and their audio features are extracted. The audio signals are then partitioned into frames, and each frame is classified by using the proposed PSO-based clustering algorithm. The performance of the proposed algorithm is evaluated using various performance metrics such as accuracy, sensitivity, and specificity that are examined. Extensive experiments reveal that the proposed algorithm outperforms the competitive algorithms. The average accuracy of the proposed algorithm is 97%, sensitivity is 98%, and specificity is 96%, which depicts that the proposed approach is efficient in detecting and classifying the speech classes.

1. Introduction

The classification of speech into voiced, unvoiced, and silent (V/UV/S) frames is a critical and difficult topic that allows for pitch estimation, automated speech identifier, speaker identification, speech analysis, speech augmentation, and speech signal compression based on whether or not the vocal cords vibrate during the creation of the speech segment [1]. The silent segment in human speech is a period of quiet that may happen at the start of statements, between words/syllables, or after utterances. When the vocal cords aperiodically vibrate, unvoiced segments are generated [2]. When the vocal cords vibrate in a regular pattern, voiced segments are generated. The SUV segmentation is a more difficult classification issue than the two-class voiced activity detection (VAD) and voiced-unvoiced (VU) classifications since it is a three-class problem. The SUV segmentation can be accomplished by combining VAD and V/U segmentations, according to research. This will need previous knowledge of the speech signal’s noise statistics, making the classification issue reliant on the noise statistics’ correctness. As a result, the SUV segmentation is often viewed as a single issue.

1.1. Preliminaries

The preliminaries of the research are the parameters of the speech that are to be extracted for classification. The five parameters of speech are as follows (https://www.clear.rice.edu/elec532/PROJECTS00/vocode/uv/uvdet.html).

1.2. Zero Crossing

The frequency where the energy inside the signal spectrum is focused is indicated by the zero-crossing count [3]. Voiced speech is produced by the periodic flow of air just at glottis activation of the vocal tract and has a low zero-crossing count in general. A noise-like source excites the vocal tract at a constriction in the interior of the vocal tract, leading to unvoiced speech with such a high zero-crossing count [4]. Silence is expected to get a lower zero-crossing frequency than unvoiced speech but equal to voiced speech.

1.3. Energy

Log energy is as follows:

Here, represents a tiny positive constant that prevents the log of zero from being computed. shows the energy of spoken data is substantially greater than that of silence. N shows the zero-crossing count.

1.4. Normalized Autocorrelation Coefficient

defines that at a unit sample delay, the normalized autocorrelation coefficient.

The association between nearby speech samples is this metric. Because of the significant association between adjacent samples of voiced speech waveforms because of the high presence of low-frequency energy in voiced sounds, this value is near to 1 [5]. Unvoiced speech, on the other hand, has a connection that is close to zero.

1.5. The Predictor Coefficient

The initial predictor coefficient in a 12-pole linear predictive coding research using the covariance technique is . At unit sample delay, this value may be shown to be the inverse of the log spectrum’s Fourier component. The first LPC coefficient greatly deviates from the spectra of the voiced, unvoiced, and quiet classes [6].

1.6. Normalized Prediction Error

where is described above, is the of the speech samples’ covariance matrix, and are the predictor coefficients. This metric quantifies the nonuniformity of the spectrum [7].

The silence, voice, and unvoiced segmentation is a more difficult classification issue than the two-class voiced activity detection and voiced-unvoiced classifications since there is a three-class problem [8]. Previous research has solely looked at segmentation in terms of silent and nonsilent frames or voiced and unvoiced frames. Furthermore, fundamental metrics collected from the speech signal, including the signal’s energy [9], zero-crossing rate, and degree of voice periodicity, were employed to achieve the V/UV classification. A single statistic from the speech signal, such as RMS energy or zero-crossing rate, may be used to detect VN/S signals. Because the quantity of any one parameter often overlaps across categories, such a technique may achieve only limited accuracy, especially when the speech is not captured in a high-quality context. For a long time, the V/U/S category has been engaged in defining the periodicity of speech [10]. Due to the fact that vocal fold vibration does not always result in a periodic signal, failure to recognize periodicity for voiced speech might result in a VN/S classification mistake [11].

When it comes to SUV categorization, one of the most significant considerations is the characteristics that must be employed. SUV categorization outperforms the others in terms of LPD-derived cepstrum and mel-frequencey cepstrum coefficients. Calculating the energy of a voice signal, on the other hand, is a very simple operation, with most algorithms relying on fundamental elements such as energy contours and zero crossings [12]. In the prior and current research, unsupervised learning, zero-crossing rate, pattern recognition algorithms, cumulates, autocorrelation algorithms, spectral parameters, and combinations of two or more of these methods have all been employed to construct SUV classification systems. The following are the techniques to classify voice, unvoiced, and silence:

1.6.1. Voiced Speech

When a system’s input excitation is a nearly regular impulse sequence, the resultant speech is referred to as voiced speech, as it seems visually periodic (see Figure 1) [13].

1.6.2. Unvoiced Speech

Unvoiced speech occurs when the stimulation is random noise-like, and the resulting speech signal is likewise arbitrary noise-like with no periodicity.

The graphic depicts the nature of natural enthusiasm and the resulting unvoiced words [15]. The unvoiced utterance, as can be observed, will be nonperiodic. The major distinction between voiced and unvoiced speech will be this. The autocorrelation analysis may also detect the nonperiodicity of unvoiced speech (see Figure 2) [16].

1.6.3. Silence

The speech creation process involves the simultaneous development of vocal and unvoiced speech, separated by a silent period [17]. There is no stimulation delivered to the vocal tract during the silent phase; hence, there is no voice production. Silence, on the other hand, is a component of the speech signal. The speech will be incomplete if there is no quiet zone between vocal and unvoiced discourse. Silence, combined with other vocal or unvoiced words, may be used to identify particular sorts of noises [18]. Even if the silent area is negligible in terms of amplitude/energy, its length is critical for comprehensible speech. The frequency where the energy inside the signal spectrum is focused is indicated by the zero-crossing count [3]. Voiced speech is produced by the periodic flow of air just at glottis activation of the vocal tract and has a low zero-crossing count in general. A noise-like source excites the vocal tract at a constriction in the interior of the vocal tract, leading to unvoiced speech with such a high zero-crossing count [4]. Silence is expected to get a lower zero-crossing frequency than unvoiced speech but equal to voiced speech.

This study proposed a novel PSO-based clustering method to categorize speech into three categories: quiet, voice, and unvoiced. These classes are grouped together based on the characteristics that were derived from them. Zero crossing, energy, normalized autocorrelation coefficient, predictor coefficient, and normalized prediction error are the five characteristics that may be extracted from speech. With the use of PSO-based clustering, an audio signal is partitioned into frames and classified according to its class by extracting these characteristics from the signal. Performance criteria such as accuracy, sensitivity, specificity, and confidence intervals are analyzed to show the usefulness of the suggested algorithm.

Furthermore, this study is organized as follows: Section 2 presents the related work. Section 3 describes the proposed methodology. Section 4 presents the comparative analyses. Finally, Section 5 concludes this study.

Many of the existing works are developed on SUV classification and segmenting methods. Researchers implemented these segmentations by using different machine learning and clustering techniques. In [19], the author presented a novel approach for segmenting dysarthric speech into silent, unvoiced, and voiced pieces. Short-time energy, zero-crossing rate, and linear prediction error variance are used to solve the segmentation issue in this example. A moving average threshold technique is presented to give completely automated “as-fit” major components that can handle highly acute dysarthric speech with changing loudness and ZCRs. The capabilities of the proposed totally automated method are validated using real-world audio signals from healthy and ataxic dysarthric speakers. [20]. As per the findings of the article, the proposed classification strategy not only increased segmentation results but also gave reliable results in low-effort settings.

A voiced-unvoiced-silence classification technique based on a time-frequency description of the measured signal, regarded as a data matrix, is proposed. The study [21] is based on a hierarchical dual-geometry data matrix analysis that makes use of the tight connection between time frames and frequency bins. The method allows for the separation of spoken and quiet frames, and then voiced and unvoiced frames, by progressively learning the associated geometry in two phases. A multilayer feedforward network was used to classify speech into voiced-unvoiced-silence. A maximum-likelihood classifier was used to analyze the network [22]. Using an MFN, a process for classifying speech into voiced, unvoiced, and silent was devised. The network VIUIS classifier is projected to be a valuable tool in this study for speech analysis and for speech-data mixed communication systems.

Using unsupervised learning provided [23] a unique voiced-unvoiced-silence categorization. Using Gaussian mixture models and the expectation-maximization approach, class-dependent statistics such as feature means, covariance matrices, and frequency probabilities of voiced, unvoiced, and silent classes are directly generated from the signal. The NTIMIT database was utilized to evaluate the learning-based categorization, and the dataset illustrated the accuracy of a completely learned classification. To remove noise from speech signals, an improved speech enhancement technique wavelet-based and spectra speech classification is proposed. Using a unique energy-based threshold, the technique splits speech into voiced, unvoiced, and silent sections before applying the wavelet transform. The detailed coefficients are thresholded to reduce noise, taking into account the distinctive properties of speech in each of the three domains [24]. For vocal parts, soft thresholding is employed, hard thresholding is used for unvoiced regions, and the wavelet coefficients for quiet zones are set to zero. The proposed technique is tested using white noise-contaminated utterances from the SPEAR collection. The technique generated better results in terms of output SNR, PESQ score, speech waveforms, and spectrograms.

The author presented a digital architecture for classifying noise-free voice segments in an instantaneous V/UV/S manner. The proposed [25] architecture computes two commonly utilized time-domain-based speech metrics, brief energy (STE), and relatively short zero-crossing rate, using the incoming sample of the speech segments. The hardware required to do on-the-fly calculations of the specified parameters is included in an algorithm state machine with such a data path (ASMD). Inside the ASMD, the decision model was implemented as a separate unit. To use the Xilinx ZedBoard Zynq Evaluation and Development Platform XC7Z020CLG484-1, the suggested architecture is prototyped on a ground gate array (FPGA). It has a maximum operating clock frequency of 185 MHz and is fully compatible with prior CORDIC-based window designs.

In [26], a hybrid CNN with long short-term memory (LSTM) is used to automatically extract ambient and microphone information from the spoken sound. In the trials, it was also looked at how the usage of voiced and unvoiced chunks of speech affected the accuracy of such environment and microphone classification. The suggested method employs a subset of the KSU-DB corpus that contains three settings, four kinds of recording equipment, 136 speakers, and 3,600 word, phrase, and speech signal recordings. In this work, the CRNN model was established, which incorporates elements of both CNN and RNN models. Speech signals were recorded as spectrograms and sent into the CRNN model as 2D images.

From the literature, it is found that the existing models suffer from various problems such as poor convergence speed [27, 28] and are stuck in local optima [2931], premature convergence speed [32, 33], gradient vanishing [34, 35], etc. Besides designing the efficient fitness function, there remains a challenge for real-time applications of metaheuristic techniques [31, 36].

2.1. Contributions of the Study

The main contributions of the study are to:(i)Classify the speech classes as voice, unvoiced, or silence by using particle swarm optimization (PSO)-based classification.(ii)Measure the efficacy of the proposed algorithm with performance parameters like accuracy, sensitivity, and specificity.

Furthermore, this study is organized in such a way that Section 2 describes the proposed methodology and Section 3 displays the results. Finally, Section 4 concludes the work.

3. Methodology

In this work, a PSO-based clustering algorithm is proposed to classify the speech classes, i.e., silence, voice, and unvoiced. These classes are clustered on the basis of their extracted features. The five features that are retrieved from the speech are zero crossing, energy, normalized autocorrelation coefficient, predictor coefficient, and normalized prediction error. An audio signal is partitioned into frames and segmented according to its class by extracting these features using PSO-based clustering. The flowchart of the proposed methodology for a brief understanding is presented in Figure 3.

To illustrate the efficiency of the suggested algorithm, performance parameters like accuracy, sensitivity, specificity, and confidence intervals are evaluated.

3.1. Particle Swarm Optimization (PSO)

The PSO is a population-based optimization algorithm that is inspired by flocks of birds’ social behavior. The PSO is often referred to as an example of evolutionary computation. A swarm of particles moves across the search zone in a PSO system. Each particle represents a potential solution to the issue of optimization [37]. The best place visited by the particle and the particle’s position in the particle’s vicinity impact the particle’s location. When a particle’s neighborhood is the whole swarm, the global best particle is the best location in the neighborhood, and the process that follows is known as the gbest PSO. The technique is known as the West PSO when small neighborhoods are adopted. The optimization problem provides a fitness function that is used to assess the performance of each particle.

The following qualities are represented by each particle in the swarm:  = particle’s current position.  = particle’s best personal position.  = particle’s present velocity.

Particle s personal best position is the best position that particle has visited thus far. The objective function is denoted by the letter . The particle’s personal best at time step is then updated aslbest and gbest are two basic techniques to PSO, with the distinction being in the neighborhood topology utilized to trade experience among particles. The best particle in the gbest model is chosen from the whole swarm. If the vector denotes the location of the global best particle, then

Here, s = size of the swarm, and A swarm is partitioned into overlapping zones of particles in the lbest model. This particle is known as the best particle in the neighborhood, and it is defined as

Particle indices are often employed to identify neighborhoods, although topological neighborhoods may also be utilized. The gbest is a special case of lbest with l = s, in which the neighborhood is the whole swarm [38]. Although the lbest PSO has more variety than the gbest PSO, it is also slower. The remainder of the chapter focuses on the gbest PSO, which is quicker.

The velocity and location are changed as follows for each iteration of gbest:

Here,  = inertia weight, and  = acceleration constants. The above equation consists of 3 components, namely, the word inertia refers to the body’s ability to remember prior speeds. The influence of the preceding velocity is controlled by the inertia weight: exploration is favored by a high inertia weight, but exploitation is favored by a low inertia weight. The cognitive aspect, , reflects the particles’ knowledge of the optimum solution. The social aspect, , symbolizes the swarm’s collective conviction on the optimum option. Various social topologies have been studied, with the star topology being the most popular.

To do a classification approach, you must first establish a fitness function. PSO, like other swarm intelligence approaches, has been defined to undertake a search in the space of solutions to maximize outcomes in situations with single and multiple objectives. It has been established that PSO may provide better outcomes in a quicker and less expensive manner than other approaches. It is also possible to parallelize it. Furthermore, it does not take advantage of the gradient of the issue to be optimized. In other words, unlike classic optimization approaches, PSO does not need a differentiable problem. It is becoming more popular as a result of its many benefits such as resilience, efficiency, and simplicity. PSO has been discovered to need less computing effort than other stochastic algorithms. As a result, this is an efficient optimization strategy for classification issues.

The PSO approach repeats the update equations above until the velocity updates are close to zero or until a certain number of iterations have been completed [39]. Particle quality is assessed using a fitness function that determines the optimality of the relevant solution.

This proposed algorithm will be used to cluster audio signals according to their respective classes.

This proposed algorithm 1 will be used to cluster audio signals according to their respective classes. The performance is measured with the following parameters.

Initialize signals with a constant value of energy and deploy in a specified area.
Initially, assign 10% of n signals to cluster nodes at random.
For i: 1 to n
Calculate distance (F1) = di, m + 1
Minimum distance = Euclidean distance value from signal i to BSm + 1
If (minimum distance > distance)
Node.id = i
End if
For j: 1 to m (total number of CH)
Calculate distance (F2) = djm (distance from signal j to CHm)
If (minimum distance > distance)
Minimum distance = distance
Node.id = j
End if
Store the distance in an array (A) which maintains values for clusters
A (Node.id).sum = A (Node.id).sum + minimum_distance
A (Node.id).num = A.(Node.id).num + 1
 End for
End for
For k: 1 to m (total number of CH)
Calculate distance (F3) = dk, m + 1
If (minimum distance > distance)
Minimum distance = distance
Cluster.id = k
End if
Store the distance in an array (A) which maintains values for clusters
A (cluster.id).sum = A (cluster.id).sum + minimum_distance
A (cluster.id).num = A(cluster.id).num + 1
End for
Calculate the total energy
Fitness function value for each node
For i: 1 to n
Fitness (clustering) = (0.25 ∗ F1) + (0.25 ∗ F2) + (0.25 ∗ F3) + (0.25 ∗ F4)
End for
Stop
3.2. Performance Metrics

The confusion matrix’s performance characteristics, such as accuracy, sensitivity, and specificity, are used to assess the suggested algorithm’s performance.Accuracy: It is the fraction of correctly recognized subjects to the total number of subjects.Sensitivity: Recall, also known as sensitivity, is the proportion of correctly positive labels recognized by our computer.Specificity: The system has appropriately classified the negative as specificity.where TP = true positive, TN = true negative, and FP = false positive.

4. Performance Analyses

The proposed PSO-based clustering method is developed and tested in this section, and the results are examined. Performance measures such as accuracy, sensitivity, and specificity are examined to illustrate the usefulness of the suggested algorithm.

Figure 4 shows the wave signals of chosen 10 test audio samples.

Table 1 depicts the information of the chosen data samples. For implementation, 10 test audio samples are taken and their sample size, frequency, frame size, and frame length are extracted and displayed. The number of frames is generated by partitioning the audio signal. These frames are clustered according to their respective classes.

Table 2 exhibits the audio samples’ performance metrics such as accuracy, sensitivity, and specificity. The average accuracy of the 10 test samples is 0.9794, or 97%, the average sensitivity value is 0.9846, or 98%, and the average specificity value is 0.9692, or 96%. These numerical values of the metrics are visualized in Figure 4. It shows the accuracy, specificity, and sensitivity of the chosen 10 test audio signals.

From Table 3, it is observable that the performance parameters vary over its mean value with confidence intervals of 90% and 95%. Accuracy varies by the values of 0.00428 and 0.00691, respectively. Sensitivity varies by the values of 0.00643 and 0.00518 and specificity with a confidence intervals of 90% and 95% by the values of 0.00643 and 0.01038, respectively. Figure 5 shows the performance parameters for the tested audio signals on the basis of accuracy, specificity, and sensitivity.

5. Conclusions

In this work, MATLAB 2020a is used for the implementation. The proposed algorithm particle swarm optimization (PSO)-based clustering algorithm is used to classify the three speech classes. These classes are silence, voice, and unvoiced. These classes are clustered based on extracted features. The five features that are retrieved from the speech are zero crossing, energy, normalized autocorrelation coefficient, predictor coefficient, and normalized prediction error. A sample of 10 audio signals is chosen for the implementation. Each audio wave is partitioned into frames, and each frame is clustered into either voice, unvoiced, or silence. In order to demonstrate the effectiveness of the proposed algorithm, performance parameters like accuracy, sensitivity, specificity, and confidence intervals are evaluated. The average accuracy of the audio samples is 97%, sensitivity is 98%, and specificity is 96%, which demonstrates that the proposed algorithm is highly accurate in clustering the speech classes. The average accuracy of the audio samples is 97%, sensitivity is 98%, and specificity is 96%, which demonstrates that the proposed algorithm is highly accurate in clustering the speech classes. This allows the smart hearing aid to distinguish between silence, voice, and unvoiced sounds.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.